Skip to content

ARROW-12379: [C++] Fix ThreadSanitizer failure in SerialExecutor#10025

Closed
pitrou wants to merge 1 commit into
apache:masterfrom
pitrou:ARROW-12379-tsan
Closed

ARROW-12379: [C++] Fix ThreadSanitizer failure in SerialExecutor#10025
pitrou wants to merge 1 commit into
apache:masterfrom
pitrou:ARROW-12379-tsan

Conversation

@pitrou

@pitrou pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member

No description provided.

@pitrou pitrou requested a review from lidavidm April 14, 2021 14:29
@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

cc @westonpace

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

I've checked locally that this fixes the TSan failures.

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue was that we might be in the middle of SpawnReal on an external thread while the main thread was destructing the state (e.g. it exited early due to an error or something)?

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

Yes.

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

Hm, wait, I shouldn't unlock before notifying in MarkFinished().

@lidavidm

Copy link
Copy Markdown
Member

Additionally, since we're handing around Executor* instead of shared_ptr or something, there's still the possibility that an I/O thread tries to use a dangling reference, right? The main thread has to make sure to block and wait for all I/O threads it spawns under all conditions.

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

Well, the idea is that the top-level task waits for all dependent tasks to finish, I think.

@lidavidm

Copy link
Copy Markdown
Member

True, it's just a potentially easy pitfall (though it comes with the territory I suppose).

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

Things may get tricky if cancellation or early exit is involved indeed...

@pitrou pitrou force-pushed the ARROW-12379-tsan branch from 9a07ed0 to 6e35a3f Compare April 14, 2021 14:58
@lidavidm

Copy link
Copy Markdown
Member

Seems Crossbow built Arrow master, not this branch. I threw up https://github.com/lidavidm/crossbow/actions/runs/748753207

@lidavidm

Copy link
Copy Markdown
Member

The crossbow job passes!

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

@lidavidm

Copy link
Copy Markdown
Member

Seemed the integration build timed out in Java but everything else passed.

@lidavidm lidavidm closed this in cb7a624 Apr 14, 2021
@pitrou pitrou deleted the ARROW-12379-tsan branch April 14, 2021 16:21
@westonpace

westonpace commented Apr 14, 2021

Copy link
Copy Markdown
Member

Thanks @pitrou . A few thoughts (nothing worrisome):

Why deque instead of queue? I don't mind the change. I just want to make sure I'm not missing some piece of knowledge.

Things may get tricky if cancellation or early exit is involved indeed...\

Cancellation should still flush the task queue. Early exit (via bad Status) should still flush the task queue. It is important we do because in both cases the remaining tasks might need to do some cleanup to handle the error. Exceptions/assertions might not but we're already in crash territory there.

@pitrou

pitrou commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

It's only for consistency, since deque is used for the threadpool executor.

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants